[SPARK-31009][SQL] Support json_object_keys function #27836

iRakson · 2020-03-06T16:29:25Z

What changes were proposed in this pull request?

A new function json_object_keys is proposed in this PR. This function will return all the keys of the outmost json object. It takes Json Object as an argument.

If invalid json expression is given, NULL will be returned.
If an empty string or json array is given, NULL will be returned.
If valid json object is given, all the keys of the outmost object will be returned as an array.
For empty json object, empty array is returned.

We can also get JSON object keys using map_keys+from_json. But json_object_keys is more efficient.

Performance result for json_object = {"a":[1,2,3,4,5], "b":[2,4,5,12333321]}

Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
JSON functions:                           Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
json_object_keys                                  11666          12361         673          0.9        1166.6       1.0X
from_json+map_keys                                15309          15973         701          0.7        1530.9       0.8X

Why are the changes needed?

This function will help naive users in directly extracting the keys from json string and its fairly intuitive as well. Also its extends the functionality of spark-sql for json strings.

Some of the most popular DBMSs supports this function.

PostgreSQL
MySQL
MariaDB

Does this PR introduce any user-facing change?

Yes. Now users can extract keys of json objects using this function.

How was this patch tested?

UTs added.

iRakson · 2020-03-06T16:41:00Z

@HyukjinKwon @cloud-fan @maropu @dongjoon-hyun

MaxGekk

@iRakson Could you show the use case of the function.

MaxGekk · 2020-03-06T16:54:36Z

We can get the same result using existing functions:

scala> val df = Seq("""{"a":1, "b":2}""").toDF("json")
df: org.apache.spark.sql.DataFrame = [json: string]

scala> df.select(map_keys(from_json($"json", MapType(StringType, StringType)))).show
+-----------------+
|map_keys(entries)|
+-----------------+
|           [a, b]|
+-----------------+

What's the benefit of having the function in public API?

iRakson · 2020-03-06T17:42:18Z

What's the benefit of having the function in public API?

json_object_keys gives better performance than map_keys+from_json.
Users are more familiar with json_object_keys as it supported by some of popular DBMSs, so they will find it easier to use.

iRakson · 2020-03-10T16:13:41Z

@HyukjinKwon @cloud-fan @maropu
Please take a look at this as well.

dongjoon-hyun · 2020-03-10T23:36:50Z

ok to test

dongjoon-hyun · 2020-03-10T23:37:28Z

Do you know how much faster it is?

json_object_keys gives better performance than map_keys+from_json.

SparkQA · 2020-03-11T02:44:09Z

Test build #119637 has finished for PR 27836 at commit 17caff4.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class JsonObjectKeys(child: Expression) extends UnaryExpression with CodegenFallback

SparkQA · 2020-03-11T07:05:02Z

Test build #119648 has finished for PR 27836 at commit 7a74b30.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

iRakson · 2020-03-17T11:14:11Z

retest this please

HyukjinKwon · 2020-03-19T07:11:22Z

retest this please

HyukjinKwon · 2020-03-19T07:11:53Z

@MaxGekk does it look good to you?

maropu · 2020-03-19T08:27:45Z

Do you know how much faster it is?

json_object_keys gives better performance than map_keys+from_json.

Has the @dongjoon-hyun comment above been already adderssed? If this func has performance benefits, I think its better to show the perf. numbers in the PR description.

iRakson · 2020-03-19T11:07:27Z

Do you know how much faster it is?

json_object_keys gives better performance than map_keys+from_json.

Has the @dongjoon-hyun comment above been already adderssed? If this func has performance benefits, I think its better to show the perf. numbers in the PR description.

I have updated the time required for map_keys+from_json and json_object_keys in PR description. I will also try to add benchmark for all the new JSON functions.
@maropu @HyukjinKwon

SparkQA · 2020-03-19T12:07:22Z

Test build #120032 has finished for PR 27836 at commit 7a74b30.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

iRakson · 2020-03-21T12:29:18Z

@MaxGekk does it look good to you?

@MaxGekk Kindly Review this.

MaxGekk · 2020-03-21T13:04:07Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala

+    var arrayBufferOfKeys = ArrayBuffer.empty[UTF8String]
+    // this handles `NULL` case
+    if (parser.nextToken() == null) {
+      throw new AnalysisException(s"$prettyName expect a JSON object but nothing is provided.")


Is the exception thrown at analysis phase? I think AnalysisException should be replaced by RuntimeException.

MaxGekk · 2020-03-21T13:32:40Z

The function works for literals but I got NPE on a column with JSON objects:

  test("json_object_keys") {
    spark.sql("""select json_object_keys('{"a": 1, "b": 2, "c": 3}')""").show(false)
    val df = Seq(
      """{"a":1}""",
      """{"a":1, "b":2}""",
      """{"a": 1, "b": 2, "c": 3}""").toDF("json")
    df.show(false)
    val dfKeys = df.selectExpr("json_object_keys(json)")
    dfKeys.show(false)
  }

+------------------------------------------+
|json_object_keys({"a": 1, "b": 2, "c": 3})|
+------------------------------------------+
|[a, b, c]                                 |
+------------------------------------------+

+------------------------+
|json                    |
+------------------------+
|{"a":1}                 |
|{"a":1, "b":2}          |
|{"a": 1, "b": 2, "c": 3}|
+------------------------+

06:28:11.089 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 in stage 3.0 (TID 3)
java.lang.NullPointerException
	at org.apache.spark.sql.catalyst.InternalRow$.$anonfun$getAccessor$16(InternalRow.scala:152)
...
Caused by: java.lang.NullPointerException
	at org.apache.spark.sql.catalyst.InternalRow$.$anonfun$getAccessor$16(InternalRow.scala:152)
	at org.apache.spark.sql.catalyst.InternalRow$.$anonfun$getAccessor$16$adapted(InternalRow.scala:151)
	at org.apache.spark.sql.catalyst.expressions.BoundReference.eval(BoundAttribute.scala:41)
	at org.apache.spark.sql.catalyst.expressions.JsonObjectKeys.json$lzycompute(jsonExpressions.scala:812)
	at org.apache.spark.sql.catalyst.expressions.JsonObjectKeys.json(jsonExpressions.scala:812)

iRakson · 2020-03-22T08:10:33Z

The function works for literals but I got NPE on a column with JSON objects:

  test("json_object_keys") {
    spark.sql("""select json_object_keys('{"a": 1, "b": 2, "c": 3}')""").show(false)
    val df = Seq(
      """{"a":1}""",
      """{"a":1, "b":2}""",
      """{"a": 1, "b": 2, "c": 3}""").toDF("json")
    df.show(false)
    val dfKeys = df.selectExpr("json_object_keys(json)")
    dfKeys.show(false)
  }

+------------------------------------------+
|json_object_keys({"a": 1, "b": 2, "c": 3})|
+------------------------------------------+
|[a, b, c]                                 |
+------------------------------------------+

+------------------------+
|json                    |
+------------------------+
|{"a":1}                 |
|{"a":1, "b":2}          |
|{"a": 1, "b": 2, "c": 3}|
+------------------------+

06:28:11.089 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 in stage 3.0 (TID 3)
java.lang.NullPointerException
	at org.apache.spark.sql.catalyst.InternalRow$.$anonfun$getAccessor$16(InternalRow.scala:152)
...
Caused by: java.lang.NullPointerException
	at org.apache.spark.sql.catalyst.InternalRow$.$anonfun$getAccessor$16(InternalRow.scala:152)
	at org.apache.spark.sql.catalyst.InternalRow$.$anonfun$getAccessor$16$adapted(InternalRow.scala:151)
	at org.apache.spark.sql.catalyst.expressions.BoundReference.eval(BoundAttribute.scala:41)
	at org.apache.spark.sql.catalyst.expressions.JsonObjectKeys.json$lzycompute(jsonExpressions.scala:812)
	at org.apache.spark.sql.catalyst.expressions.JsonObjectKeys.json(jsonExpressions.scala:812)

I fixed this issue and also added a new test case for this particular issue.

SparkQA · 2020-03-22T13:07:54Z

Test build #120162 has finished for PR 27836 at commit ce18e41.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2020-03-22T23:17:16Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala

+
+  override def eval(input: InternalRow): Any = {
+    try {
+      lazy val json = child.eval(input).asInstanceOf[UTF8String]


why we need lazy here?

maropu · 2020-03-22T23:24:56Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala

+  arguments = """
+    Arguments:
+      * json_object - A JSON object is required as argument. `Null` is returned, if an invalid JSON
+          string is given. `Runtime Exception` is thrown, if null string or JSON array is given.


How about this?

* json_object - A JSON object. If it is an invalid string, the function returns null. If it is a JSON array or null, a runtime exception will be thrown.

btw, the error handling is consistent with the other JSON functions?

I checked the other json functions. IllegalArgumentException is thrown by from_json for invalid inputs and it is better than RuntimeException.

maropu · 2020-03-22T23:26:49Z

sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala


+  test("json_object_keys") {
+    val df = Seq("""{"a": 1, "b": 2, "c": 3}""".stripMargin)
+      .toDF("json")


nit: you don't need the line break here.

SparkQA · 2020-03-23T20:54:35Z

Test build #120213 has finished for PR 27836 at commit bac0b75.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-03-23T21:54:52Z

Test build #120215 has finished for PR 27836 at commit de7e02b.

This patch fails PySpark pip packaging tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-03-24T09:39:09Z

retest this please

SparkQA · 2020-03-24T15:23:17Z

Test build #120264 has finished for PR 27836 at commit de7e02b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

iRakson · 2020-03-27T15:19:24Z

gentle ping @HyukjinKwon @maropu @MaxGekk

MaxGekk · 2020-03-27T17:28:48Z

sql/core/src/test/resources/sql-tests/results/json-functions.sql.out

+-- !query schema
+struct<>
+-- !query output
+java.lang.IllegalArgumentException


Why do you throw IllegalArgumentException for '[1, 2, 3]' but NULL for '{"key": 45, "random_string"}'? It looks slightly inconsistent though both input are invalid.

For invalid JSON strings NULL is thrown. '[1, 2, 3]' is a valid JSON string. But not a JSON object. So invalid argument.

Regarding this issue. I am thinking of implementing a function ISJSON(). It will return NULL for null, 1 for valid JSON string and 0 for invalid JSON string.

This function is supported by most of the major DBMSs.

Currently we return NULL for invalid JSON.

In this PR as well as in #27759 , NULL is returned for null JSON strings. So behaviour might be confusing as we are returning NULL for invalid JSON strings as well.

SparkQA · 2020-04-06T17:48:54Z

Test build #120874 has finished for PR 27836 at commit e45e61c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-04-07T22:35:40Z

Hi, @iRakson .
Could you resolve the conflicts?

dongjoon-hyun · 2020-04-07T22:37:50Z

...catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/JsonExpressionsSuite.scala

+      ("""{"key": 45, "random_string"}""", null),
+      ("", null),
+      ("[]", null),
+      ("""[{"key": "JSON"}]""", null)


Could you sort the test cases a little meaningfully, please?

dongjoon-hyun · 2020-04-07T22:38:02Z

...catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/JsonExpressionsSuite.scala

+      ("", null),
+      ("[]", null),
+      ("""[{"key": "JSON"}]""", null)
+    ).foreach{


nit. ).foreach{ -> ).foreach {

dongjoon-hyun · 2020-04-07T22:39:06Z

sql/core/src/test/resources/sql-tests/inputs/json-functions.sql

+select json_object_keys(200);
+select json_object_keys('');
+select json_object_keys('{"key": 1}');
+select json_object_keys('{}');


Shall we switch line 67 and 68?

dongjoon-hyun · 2020-04-07T22:41:06Z

sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala

+    val df = Seq("""{"a": 1, "b": 2, "c": 3}""").toDF("json")
+    val dfKeys = df.selectExpr("json_object_keys(json)")
+    checkAnswer(dfKeys, Row(Array("a", "b", "c")))
+  }


We can revert this change from this PR since we have a superset already.

dongjoon-hyun · 2020-04-07T22:42:04Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala

+ * A function which returns all the keys of the outmost JSON object.
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(json_object) - returns all the keys of the outmost JSON object as an array.",


nit. returns -> Returns?

dongjoon-hyun · 2020-04-07T22:44:03Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala

+        parser => getJsonKeys(parser, input)
+      }
+    } catch {
+      case _: JsonProcessingException => null


Just a question: There is no need to handle IOException here?

Yes, nextToken() throws IOException. I will update it with other changes.

dongjoon-hyun · 2020-04-07T22:46:25Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala

+    // return null if an empty string or any other valid JSON string is encountered
+    if (parser.nextToken() == null || parser.currentToken() != JsonToken.START_OBJECT) {
+      return null
+    }


Shall we move this exception handling part (line 849 ~ 851) from this getJsonKey into line 839 (outside of this function)?

iRakson · 2020-04-08T08:39:06Z

@dongjoon-hyun I resolved the conflicts and also have tried to handle all the review comments.
Kindly review.

SparkQA · 2020-04-08T13:01:32Z

Test build #120963 has finished for PR 27836 at commit 7eba58c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-04-08T14:09:31Z

Test build #120964 has finished for PR 27836 at commit faa3571.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

+1, LGTM. Thank you, @iRakson and all.
Merged to master for Apache Spark 3.1.0.

### What changes were proposed in this pull request? A new function `json_object_keys` is proposed in this PR. This function will return all the keys of the outmost json object. It takes Json Object as an argument. - If invalid json expression is given, `NULL` will be returned. - If an empty string or json array is given, `NULL` will be returned. - If valid json object is given, all the keys of the outmost object will be returned as an array. - For empty json object, empty array is returned. We can also get JSON object keys using `map_keys+from_json`. But `json_object_keys` is more efficient. ``` Performance result for json_object = {"a":[1,2,3,4,5], "b":[2,4,5,12333321]} Intel(R) Core(TM) i7-9750H CPU 2.60GHz JSON functions: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ json_object_keys 11666 12361 673 0.9 1166.6 1.0X from_json+map_keys 15309 15973 701 0.7 1530.9 0.8X ``` ### Why are the changes needed? This function will help naive users in directly extracting the keys from json string and its fairly intuitive as well. Also its extends the functionality of spark-sql for json strings. Some of the most popular DBMSs supports this function. - PostgreSQL - MySQL - MariaDB ### Does this PR introduce any user-facing change? Yes. Now users can extract keys of json objects using this function. ### How was this patch tested? UTs added. Closes apache#27836 from iRakson/jsonKeys. Authored-by: iRakson <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

svmgarg · 2023-05-31T08:18:25Z

We can get the same result using existing functions:

scala> val df = Seq("""{"a":1, "b":2}""").toDF("json")
df: org.apache.spark.sql.DataFrame = [json: string]

scala> df.select(map_keys(from_json($"json", MapType(StringType, StringType)))).show
+-----------------+
|map_keys(entries)|
+-----------------+
|           [a, b]|
+-----------------+

What's the benefit of having the function in public API?

We don't need to define the schema in Json_object_keys

MaxGekk reviewed Mar 6, 2020

View reviewed changes

MaxGekk reviewed Mar 21, 2020

View reviewed changes

iRakson force-pushed the jsonKeys branch from 9539a72 to ce18e41 Compare March 22, 2020 08:08

maropu reviewed Mar 22, 2020

View reviewed changes

MaxGekk reviewed Mar 27, 2020

View reviewed changes

dongjoon-hyun reviewed Apr 7, 2020

View reviewed changes

dongjoon-hyun added the SQL label Apr 7, 2020

iRakson added 10 commits April 8, 2020 10:56

[SPARK-31009][SQL] Support json_object_keys function

000a2c3

fix

9f8ecd5

review comment fix

bab2db6

review comment fix

8e65448

fix

d00cf6b

fix

7b57159

review comment fix

a328bd6

review comment fix

a228657

review comment fix

4dfe35e

review comment fix

7eba58c

iRakson force-pushed the jsonKeys branch from e45e61c to 7eba58c Compare April 8, 2020 08:31

fix

faa3571

iRakson requested a review from dongjoon-hyun April 8, 2020 08:39

dongjoon-hyun approved these changes Apr 8, 2020

View reviewed changes

dongjoon-hyun closed this in b562423 Apr 8, 2020

[SPARK-31009][SQL] Support json_object_keys function #27836

[SPARK-31009][SQL] Support json_object_keys function #27836

Uh oh!

Conversation

iRakson commented Mar 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

iRakson commented Mar 6, 2020

Uh oh!

MaxGekk left a comment

Choose a reason for hiding this comment

Uh oh!

MaxGekk commented Mar 6, 2020

Uh oh!

iRakson commented Mar 6, 2020

Uh oh!

iRakson commented Mar 10, 2020

Uh oh!

dongjoon-hyun commented Mar 10, 2020

Uh oh!

dongjoon-hyun commented Mar 10, 2020

Uh oh!

SparkQA commented Mar 11, 2020

Uh oh!

SparkQA commented Mar 11, 2020

Uh oh!

iRakson commented Mar 17, 2020

Uh oh!

HyukjinKwon commented Mar 19, 2020

Uh oh!

HyukjinKwon commented Mar 19, 2020

Uh oh!

maropu commented Mar 19, 2020

Uh oh!

iRakson commented Mar 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Mar 19, 2020

Uh oh!

iRakson commented Mar 21, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaxGekk commented Mar 21, 2020

Uh oh!

iRakson commented Mar 22, 2020

Uh oh!

SparkQA commented Mar 22, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 23, 2020

Uh oh!

SparkQA commented Mar 23, 2020

Uh oh!

HyukjinKwon commented Mar 24, 2020

Uh oh!

SparkQA commented Mar 24, 2020

Uh oh!

iRakson commented Mar 27, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 6, 2020

Uh oh!

dongjoon-hyun commented Apr 7, 2020

iRakson commented Mar 6, 2020 •

edited

Loading

iRakson commented Mar 19, 2020 •

edited

Loading

dongjoon-hyun Apr 7, 2020 •

edited

Loading